Introduction to Enron Dataset
Enron was one of the largest companies in the United States resulting into bankruptcy due to corporate fraud which is one of the largest bankruptcies in U.S. History.In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives.
The main goal is identifying person of interests (POI's) using supervised machine learning algorithms for prediction.This model will classify weather the individual is a POI or notPOI by using rest of the features available and various machine learning algorithms.
In [1]:
#!/usr/bin/python
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
### Task 1: Select what features you'll use.
### features_list is a list of strings, each of which is a feature name.
### The first feature must be "poi".
financial_features = ['salary', 'deferral_payments', 'total_payments',
'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income',
'total_stock_value', 'expenses', 'exercised_stock_options', 'other',
'long_term_incentive', 'restricted_stock', 'director_fees']#(all units are in US dollars)
email_features = ['to_messages', 'from_poi_to_this_person', 'email_address',
'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']
#(units are generally number of emails messages;
#notable exception is ‘email_address’, which is a text string)
poi_label = ['poi']# (boolean, represented as integer)
features_list = poi_label + financial_features + email_features
### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
In [2]:
import pandas as pd
import numpy as np
enron= pd.DataFrame.from_dict(data_dict, orient = 'index')
print"total poi in dataset:", sum(enron['poi']==1)
#enron.describe()
#We can see all the null values for each index
In [3]:
enron = enron.replace('NaN', np.nan)
print(enron.info())
enron.describe()
Out[3]:
From th above information about dataset we can conlcude that any point with less than 73 non-null will be having more than 50% of missing data. And the features seem to be in more than 50% null group
Feature | No.of non-null out of 146 |
---|---|
deferral_payments | 39 non-null |
restricted_stock_deferred | 18 non-null |
loan_advances | 4 non-null |
director_fees | 17 non-null |
deferred_income | 49 non-null |
long_term_incentive | 66 non-null |
In [4]:
#now checking these features for poi to conclude how much of data is missing for a poi .
missing = ['loan_advances', 'director_fees', 'restricted_stock_deferred',\
'deferral_payments', 'deferred_income', 'long_term_incentive']
enron_poi=enron[enron['poi']==1][missing]
enron_poi.info()
In [5]:
# its better to remove these with less non null values
removing = ['loan_advances', 'director_fees', 'restricted_stock_deferred']
for x in removing:
if x in features_list:
features_list.remove(x)
features_list
Out[5]:
In [6]:
### Task 2: Remove outliers
#visualising the outlier
import matplotlib.pyplot
e = enron[(enron.total_payments != np.nan) & (enron.total_stock_value != np.nan)]
matplotlib.pyplot.scatter(x="total_payments", y="total_stock_value", data=e)
matplotlib.pyplot.xlabel("total_payments")
matplotlib.pyplot.ylabel("total_stock_value")
matplotlib.pyplot.show()
In [7]:
# removing outlier
enron.total_payments.idxmax()
Out[7]:
In [8]:
#droping total it must be a spreadsheet mistake
enron=enron.drop("TOTAL")
In [9]:
#data_dict.pop( 'TOTAL', 0 )
e = enron[(enron.total_payments != np.nan) & (enron.total_stock_value != np.nan)]
matplotlib.pyplot.scatter(x="total_payments", y="total_stock_value", data=e)
matplotlib.pyplot.xlabel("total_payments")
matplotlib.pyplot.ylabel("total_stock_value")
matplotlib.pyplot.show()
enron.total_payments.idxmax()
Out[9]:
LAY KENNETH L is the next outlier but it is a valid point
In [10]:
#After observing insiderpay.pdf file, I got to know it is not a person so we have to remove THE TRAVEL AGENCY IN THE PARK.
enron=enron.drop("THE TRAVEL AGENCY IN THE PARK")
In [11]:
enron[enron[financial_features].isnull().all(axis=1)].index
Out[11]:
In [12]:
#There is 1 person without any financial data that will also need to be removed.
enron=enron.drop( 'LOCKHART EUGENE E')
In [13]:
enron = enron.replace(np.nan, 'NaN') # since to use tester code, i needed to convert back to "NaN"
data_dict = enron[features_list].to_dict(orient = 'index')
In [14]:
from tester import test_classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
if "email_address" in features_list:
features_list.remove("email_address")
feat=features_list
else :
feat=features_list
test_classifier(clf, data_dict, feat)
In [15]:
### Task 3: Create new feature(s)
### Store to my_dataset for easy export below.
my_dataset = data_dict
In [16]:
#Adding three new features
for key, value in my_dataset.items():
if value['from_messages'] == 'NaN' or value['from_this_person_to_poi'] == 'NaN':
value['person_to_poi/total_msgs'] = 0.0
else:
value['person_to_poi/total_msgs'] = value['from_this_person_to_poi'] / (1.0*value['from_messages'])
if value['to_messages'] == 'NaN' or value['from_poi_to_this_person'] == 'NaN':
value['poi_to_person/to_msgs'] = 0.0
else:
value['poi_to_person/to_msgs'] = value['from_poi_to_this_person'] / (1.0*value['to_messages'])
if value['shared_receipt_with_poi'] == 'NaN' or value['from_poi_to_this_person'] == 'NaN' \
or value['from_this_person_to_poi'] == 'NaN':
value['total_poi_interaction'] = 0.0
else:
value['total_poi_interaction'] = value['shared_receipt_with_poi'] + \
value['from_this_person_to_poi'] + \
value['from_poi_to_this_person']
features_new_list=features_list+['person_to_poi/total_msgs','poi_to_person/to_msgs','total_poi_interaction']
In [17]:
#Selectkbest used to rank the features
data = featureFormat(my_dataset, features_new_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
selector = SelectKBest(k='all').fit(features, labels)
results = pd.DataFrame(selector.scores_,
index=features_new_list[1:])
results.columns = ['Importances']
results = results.sort(['Importances'], ascending=False)
results
Out[17]:
In [18]:
from tester import test_classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
if "email_address" in features_new_list:
features_new_list.remove("email_address")
feat=features_new_list
else :
feat=features_list
test_classifier(clf, my_dataset, features_new_list)
In [19]:
### Task 4: Try a varity of classifiers
### Please name your classifier clf for easy export below.
### Note that if you want to do PCA or other multi-stage operations,
### you'll need to use Pipelines.
# Provided to give you a starting point. Try a variety of classifiers.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif, chi2
from sklearn.feature_selection import RFE
from sklearn import tree
from sklearn.svm import SVC, SVR
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
#from sklearn.model_selection import RandomizedSearchCV
In [20]:
parameters={}
parameters["DecisionTreeClassifier"] = [{'min_samples_split': [2,3], 'criterion': [ 'entropy']}]
parameters["GaussianNB"] = [{ 'selection__k':[9,10,11], 'pca__n_components': [2,3,4,5] }]
parameters["SVC"] = [{'selection__k':[11], 'svc__kernel': ['rbf',"sigmoid"], 'svc__C': [x/1.0 for x in range(1, 100,10)]
,'svc__gamma':[0.1**(x) for x in range(1,9)]}]
parameters["AdaBoostClassifier"] = [{ "base_estimator":[DecisionTreeClassifier(min_samples_split= 2, criterion= 'entropy')],'learning_rate' : [x/30.0 for x in range(1, 30)],'n_estimators' : range(1,100,20),\
'algorithm': ['SAMME','SAMME.R'] }]
parameters["KNeighborsClassifier"] = [{'selection__k': [10,11], "knn__p":range(3,4),'pca__n_components': [2,3,4,5],"knn__n_neighbors": range(1,10), 'knn__weights': ['uniform','distance'] ,'knn__algorithm': ['ball_tree','kd_tree','brute']}]
pipe={}
pipe["DecisionTreeClassifier"] = DecisionTreeClassifier()
pipe["GaussianNB"] = Pipeline([('scaler', MinMaxScaler()),('selection', SelectKBest()),('pca', PCA()),('naive_bayes', GaussianNB())])
pipe["SVC"] =Pipeline([('selection', SelectKBest()),('scaler', StandardScaler())
,('svc', SVC())])
pipe["AdaBoostClassifier"] = AdaBoostClassifier()
pipe["KNeighborsClassifier"] = Pipeline([('selection',SelectKBest()),
('pca', PCA()),('knn', KNeighborsClassifier())])
In [21]:
### Task 5: Tune your classifier to achieve better than .3 precision and recall
### using our testing script. Check the tester.py script in the final project
### folder for details on the evaluation method, especially the test_classifier
### function. Because of the small size of the dataset, the script uses
### stratified shuffle split cross validation. For more info:
### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html
data = featureFormat(my_dataset, features_new_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
# Example starting point. Try investigating other evaluation techniques!
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
train_test_split(features, labels, test_size=0.3, random_state=42)
In [22]:
clf=DecisionTreeClassifier()
clf_name = clf.__class__.__name__
grid = GridSearchCV(estimator =pipe[clf_name],param_grid = parameters[clf_name],\
cv = StratifiedKFold(labels_train, n_folds = 6) \
,n_jobs = -1,scoring = 'f1')
grid.fit(features_train, labels_train)
print clf_name
test_classifier(grid.best_estimator_, my_dataset, features_new_list)
In [23]:
clf=SVC()
clf_name = clf.__class__.__name__
grid = GridSearchCV(estimator =pipe[clf_name],param_grid = parameters[clf_name],\
cv = StratifiedKFold(labels_train, n_folds = 6) \
,n_jobs = -1,scoring = 'f1')
grid.fit(features_train, labels_train)
print clf_name
test_classifier(grid.best_estimator_, my_dataset, features_new_list)
In [24]:
clf=AdaBoostClassifier()
clf_name = clf.__class__.__name__
grid = GridSearchCV(estimator =pipe[clf_name],param_grid = parameters[clf_name],\
cv = StratifiedKFold(labels_train, n_folds = 6) \
,n_jobs = -1,scoring = 'f1')
grid.fit(features_train, labels_train)
print clf_name
test_classifier(grid.best_estimator_, my_dataset, features_new_list)
In [25]:
clf=GaussianNB()
clf_name = clf.__class__.__name__
grid = GridSearchCV(estimator =pipe[clf_name],param_grid = parameters[clf_name],\
cv = StratifiedKFold(labels_train, n_folds = 6) \
,n_jobs = -1,scoring = 'f1')
grid.fit(features_train, labels_train)
print clf_name
test_classifier(grid.best_estimator_, my_dataset, features_new_list)
In [26]:
clf=KNeighborsClassifier()
clf_name = clf.__class__.__name__
grid = GridSearchCV(estimator =pipe[clf_name],param_grid = parameters[clf_name],\
cv = StratifiedKFold(labels_train, n_folds = 6) \
,n_jobs = -1,scoring = 'f1')
grid.fit(features_train, labels_train)
print clf_name
test_classifier(grid.best_estimator_, my_dataset, features_new_list)
Several different classifiers are deployed:
adaboost
There was not much change in accuracy. Highest accuracy was for Knearest,Highest precision is for Knearest,Highest recall is for decision tree ,I ended up with really high precision scores for k-nearest neighbors. Unfortunately the recall scores for these weren't as high andalmost all were equal. The F1 score was highest for the KNN classifier.
In [27]:
### Task 6: Dump your classifier, dataset, and features_list so anyone can
### check your results. You do not need to change anything below, but make sure
### that the version of poi_id.py that you submit can be run on its own and
### generates the necessary .pkl files for validating your results.
dump_classifier_and_data(clf, my_dataset, features_list)
What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier)
Tuning a parameter of an algorithm's main focus is to optimize and get the best output.It mainly focuses on balancing the bias variance trade off ,
High bias model: | High variance model: |
---|---|
where it does not have the capacity to learn anything practically and ignores data. | good at recreating data that it has seen before.Where it cannot generalize new data. |
We need a low bias and low variance model to get the most optimized output. Not all the parameters need to be tuned , but there might be a chance for underfitting or overfitting the data,for these cases we need to tune certain parameters of a given estimator.
Gaussian NB doesnt have any parameters for tuning. Tuning is done by using GridSearchCV ,
GridSearchCV basically is a systematic way of working through various conjuctions of parameter tunes, cross-validating as it goes to determine which tune gives the best performance.
For a given Machine learning algorithm ,for high performance there are diiferent tuning parameters , to find them we use gridsearchCV. If we input few tuning parameter with a list of possible values for each parameter and it makes different combinations. Grid search technique gives parameters that maximize the score,Which leads to Optimization of learning algorithm.
The grid search used a stratified k-fold (set to 6) cross validation so that test and train subset were balanced over the target POI class. My goal was to maximize the precision and recall score so i choose scoring parameter as 'f1' to get a optimized model selection.
The measure of estimating the model's performance on unseen data is called Validation. The major problem with machine learning algorithm is underfitting or overfitting of our data. The most important thing is to get to know how the model will generalize to unseen data.
We split our data into unseen (testing data) and training data to measure the preformance of the model.For example taking k-fold technique in picture
k-fold technique:
First we divide our data set into k parts.A single part is selected as test set and rest (k-1) parts are taken as training set.
We run Different learning experiments,every part acts as the testing set one time, and rest acts as the training set K-1 times.
Assume a1,a2,....ak (a is the data set and divided into k parts).In each k experiments you pick one of the subsets in aloop from a1 to ak as testing data and rest as training data.
We take average of the test result performance by running multiple times(ie,train our machine learing algoritm and test it using our test set).
If we observe Kfold is using all the data for testing as well as all the data for training , hence average of these test results gives us an accurate score.
In grid search there might be a chance to overfit the validation set since we use it many times to evaluate performance of different points on the grid and choose a point that delivered good performance.
Without k-fold cross-validation the risk is higher that grid search will select different parameter value combinations that perform very well on a specific train-test split but poorly on some.
We used stratified kfold cross validation.This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class. Stratification of folds in a cross-validation ensure that each test and train set have a balanced proportion of the target class.Hence the number of poi to non pois in each fold would be equal,leading to optimized output.
I would like to see the preformance of the decision tree classifier
$$ Accuracy= \frac{True Positives + True Negatives}{Total Predictions }$$
$$ Precision= \frac{True Positives}{True Positives +False Positives}$$
$$ Recall= \frac{True Positives}{True Positives +False Negitives}$$
Context | Accuracy: | Precision: | Recall: |
---|---|---|---|
After Data exploration | 0.79813 | 0.22947 | 0.21800 |
After feature selection | 0.81767 | 0.31144 | 0.30350 |
After tuning parameters | 0.828 | 0.35644 | 0.36000 |
The Decision Tree classifier can correctly identify a person as POI or non-POI with a accuracy of 0.83 with precision and recall scores of 0.356 and 0.36.With the Precision score being 0.364,model can identify 35.6% as actual POI and rest 64.4% are not correctly identified by the classifier.Recall score being 0.36 ,If a POI is present in the test set, 36% of the time classifier would be correctly label the POI.